Airbnb is a prime example of a disruptive innovation, that is now one of the largest marketplaces for accomodation with over 7 million properties in more than 220 countries. With this project we sought to utilize scraped data from Airbnb listings to carry out statistical analyses and ultimately predict the total cost for two people staying four nights in the city of Munich, Germany. After initial cleaning and wrangling of the dataset, an exploratory data analysis (EDA) was carried out to investigate existing relationships between variables, especially within and between price, neighbourhood / region, room and property type, as well as reviews and cancellation policy. As will be explained in greater detail below, we grouped the neighbourhoods within zones based on both personal experience and an official map of zones based on accomodation quality and price from the city of Munich. Key observations within our EDA were that there is a heavily right skewed distribution of price and reviews, and that no linear relationship regarding price could be observed; this led us to use the log of the total price for 4 days going forward with our regression. We progressively improved our model of regression by investigating the effect of all variables as displayed through t- and p-values. Our final and best model includes the most extensive list of variables, including for example the addition of logical variables for the only two significant amenities (elevator and shampoo). We ultimately arrived at an adjusted R-squared value of around 40%. Given that the correlation matrix and other early analyses showed rather weak / limited relationships between variables, we believe this is a good result based on the given dataset. Lastly, plots of residuals (i.e. QQ-plot, residuals vs. fitted) as well as variation inflation factor analyses showed that all assumptions of a linear regression (L-I-N-E) were met. Source
(Chunk 1: Introduction of project)
Importing original data and libraries we need:
library(tidyverse) # Load ggplot2, dplyr, and all the other tidyverse packages
library(mosaic)
library(ggthemes)
library(GGally)
library(readxl)
library(here)
library(skimr)
library(janitor)
library(broom)
library(infer)
library(vroom)
library(stringr)
library(leaflet)
library(broom)
library(huxtable)
library(ggfortify)
library(rsample)listings <- vroom("http://data.insideairbnb.com/germany/bv/munich/2020-06-20/data/listings.csv.gz") %>%
clean_names()
#glimpse(listings) # checking variable headersWe first select all potentially relevant variables from our data frame. The data is cleaned into number or factor to begin Exploratory Data Analysis (EDA). The raw dataset we create here is called “munich_listings”.
#Selecting all the relevant variables
munich_listings<- listings %>%
select(id,
host_is_superhost,
host_listings_count,
neighbourhood_cleansed,
latitude,
longitude,
property_type,
room_type,
accommodates,
bathrooms,
bedrooms,
beds,
bed_type,
#square_feet, we noticed that a lot of values are missing so excluded this variable
price,
security_deposit,
cleaning_fee,
guests_included,
extra_people,
minimum_nights,
maximum_nights,
number_of_reviews,
reviews_per_month,
review_scores_rating,
review_scores_accuracy,
review_scores_cleanliness,
review_scores_checkin,
review_scores_communication,
review_scores_location,
review_scores_value,
is_location_exact,
amenities,
instant_bookable,
cancellation_policy,
availability_365,
availability_90,
last_review,
listing_url,
last_scraped) %>%
#Converting characters to "doubles" and factors where appropriate
mutate(neighbourhood_cleansed=factor(neighbourhood_cleansed),
property_type,
room_type=factor(room_type),
price=parse_number(price),
security_deposit=parse_number(security_deposit),
cleaning_fee=parse_number(cleaning_fee),
extra_people=parse_number(extra_people),
cancellation_policy=factor(cancellation_policy),
bed_type=factor(bed_type),
amenities_count= str_count(listings$amenities, ","))
#Inspecting data frame to make sure all the variables are correctly attributed
glimpse(munich_listings) ## Rows: 11,172
## Columns: 39
## $ id <dbl> 36720, 97945, 114695, 127383, 157808, 1...
## $ host_is_superhost <lgl> FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, ...
## $ host_listings_count <dbl> 1, 1, 3, 2, 0, 1, 1, 1, 2, 1, 1, 1, 2, ...
## $ neighbourhood_cleansed <fct> Ludwigsvorstadt-Isarvorstadt, Hadern, B...
## $ latitude <dbl> 48.1, 48.1, 48.1, 48.2, 48.2, 48.1, 48....
## $ longitude <dbl> 11.6, 11.5, 11.6, 11.6, 11.6, 11.5, 11....
## $ property_type <chr> "Apartment", "Apartment", "Apartment", ...
## $ room_type <fct> Entire home/apt, Entire home/apt, Entir...
## $ accommodates <dbl> 2, 2, 5, 4, 2, 3, 4, 2, 2, 2, 2, 1, 16,...
## $ bathrooms <dbl> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,...
## $ bedrooms <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ beds <dbl> 1, 1, 3, 1, 1, 1, 2, 1, 1, 0, 1, 1, 0, ...
## $ bed_type <fct> Futon, Real Bed, Real Bed, Real Bed, Re...
## $ price <dbl> 95, 80, 95, 120, 35, 55, 55, 65, 54, 67...
## $ security_deposit <dbl> 100, NA, 500, NA, 100, 0, 200, NA, 190,...
## $ cleaning_fee <dbl> 30, 10, 60, 28, 10, 60, 20, NA, 32, NA,...
## $ guests_included <dbl> 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, ...
## $ extra_people <dbl> 30, 10, 50, 0, 15, 30, 15, 0, 0, 0, 0, ...
## $ minimum_nights <dbl> 2, 2, 2, 2, 1, 3, 2, 3, 1, 2, 3, 2, 1, ...
## $ maximum_nights <dbl> 730, 90, 30, 14, 36, 90, 1125, 14, 4, 3...
## $ number_of_reviews <dbl> 25, 131, 53, 84, 0, 33, 467, 64, 211, 8...
## $ reviews_per_month <dbl> 0.34, 1.23, 0.49, 0.76, NA, 0.31, 4.39,...
## $ review_scores_rating <dbl> 98, 97, 95, 98, NA, 93, 99, 91, 97, 97,...
## $ review_scores_accuracy <dbl> 10, 10, 9, 10, NA, 9, 10, 9, 10, 10, 10...
## $ review_scores_cleanliness <dbl> 10, 10, 10, 10, NA, 9, 10, 9, 10, 10, 9...
## $ review_scores_checkin <dbl> 10, 10, 10, 10, NA, 9, 10, 10, 10, 10, ...
## $ review_scores_communication <dbl> 10, 10, 10, 10, NA, 10, 10, 10, 10, 10,...
## $ review_scores_location <dbl> 10, 9, 9, 10, NA, 9, 10, 9, 10, 10, 10,...
## $ review_scores_value <dbl> 9, 9, 9, 10, NA, 9, 10, 9, 9, 10, 9, 10...
## $ is_location_exact <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU...
## $ amenities <chr> "{TV,\"Cable TV\",Internet,Wifi,Kitchen...
## $ instant_bookable <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE...
## $ cancellation_policy <fct> strict_14_with_grace_period, flexible, ...
## $ availability_365 <dbl> 0, 82, 59, 6, 0, 142, 260, 90, 0, 111, ...
## $ availability_90 <dbl> 0, 2, 48, 6, 0, 4, 46, 90, 0, 43, 0, 89...
## $ last_review <date> 2017-07-22, 2019-10-03, 2019-10-06, 20...
## $ listing_url <chr> "https://www.airbnb.com/rooms/36720", "...
## $ last_scraped <date> 2020-06-21, 2020-06-20, 2020-06-21, 20...
## $ amenities_count <int> 10, 35, 36, 37, 24, 37, 32, 19, 31, 22,...
In “munich_listing”, we have 11172 items and 46 columns. Here are some noticeable changes we made: 1. neighbourhood_cleansed, room_type, cancellation_policy and bed_type are changed into factors. 1. price, security_deposit, cleaning_fee, extra_people and amenities_count are changed into numbers.
So now we have: host_is_superhost, is_location_exact, instant_bookable as logical variable neighbourhood_cleansed, room_type, bed_type, cancellation_policy as factor variable amenitiesm, property_type as character variable
We now create a new data frame called “munich_listings_cleaned” to do some required changes. Here, we deal with missing values/NAs, and clean the data for property type. Also, we filter the items upon min/max nights and accommodates for the 2 people to live for 4 nights.
#Filter dataset for two people and 4 nights
#Clean dataset for cleaning_fee, security_deposit, property_type, minimum_nights and accommodates
munich_listings_cleaned <- munich_listings %>%
mutate(cleaning_fee = case_when( #considering cleaning_fee as 0 if displayed as NA
is.na(cleaning_fee) ~ 0,
TRUE ~ cleaning_fee),
security_deposit = case_when( #considering security_deposit as 0 if displayed as NA
is.na(security_deposit) ~ 0,
TRUE ~ security_deposit),
prop_type_simplified = case_when( #regrouping of property_types: put all less popular property types into "Other"
property_type %in% c("Apartment",
"House",
"Condominium",
"Loft")~ property_type ,
TRUE ~ "Other"),
prop_type_simplified=factor(prop_type_simplified)) %>% #creating factors
filter(minimum_nights<=4,
maximum_nights>=4,
accommodates>=2) #filtering dataframe for 2 people and 4 nights
#Visually inspecting cleaned data set
#glimpse(munich_listings_cleaned)
#skim(munich_listings_cleaned)For the NAs: We assume NA as 0 in cleaning fee and security deposit, which means we can book Airbnb without paying for these 2 services. So we didn’t make deletion here.
For property_type: We arranged the data set and find the top 5 kinds of Airbnbs in Munich, which are Apartment, House, Condominium, Loft and others. We transferred the variable into factors.
Filtering: We filter the room with minimum_night and maximum_night so that they can be booked for a 4-night stay. Also, the room should accommodate at least 2 people.
Then, we construct the formula for the total price of 4 days into data frame munich_listings_total_price: We create total_price_4_days as our target variable for regression representing total price of 4-night stay of two people. Note that the if_else statement allows us to include the option of adding 1 extra guest to an AirBnB that has accommodates = 1. The final multiplier of 1.142 is the 14.2% (service fee)[https://www.airbnb.co.uk/help/article/1857/what-are-airbnb-service-fees#:~:text=Host%20service%20fee&text=This%20fee%20is%20calculated%20from,deducted%20from%20the%20host%20payout] for AirBnB bookings that the company charges per booking.
We now create a new data frame called “munich_listings_region” grouping the Airbnbs geographically and making some changes for the subsequent analysis.
Three variable classes are created: region: grouped into 5 by the average price of each neighborhood rating_group: grouped into 3 by whether the rating is over 90 Amenities: Many different amenity words were checked for significance, only two remained. Interestingly theyre shampoo and elevator.
munich_listings_region <- munich_listings_total_price %>%
mutate(
region = case_when( #creating variable that clusters neighbourhoods for further analysis
neighbourhood_cleansed=="Altstadt-Lehel"~"zone_1",
neighbourhood_cleansed=="Ludwigsvorstadt-Isarvorstadt"~"zone_1",
neighbourhood_cleansed=="Maxvorstadt"~"zone_1",
neighbourhood_cleansed=="Schwabing-West"~"zone_2",
neighbourhood_cleansed=="Au-Haidhausen"~"zone_2",
neighbourhood_cleansed=="Sendling"~"zone_2",
neighbourhood_cleansed=="Sendling-Westpark"~"zone_2",
neighbourhood_cleansed=="Schwanthalerhöhe"~"zone_1",
neighbourhood_cleansed=="Neuhausen-Nymphenburg"~"zone_3",
neighbourhood_cleansed=="Moosach"~"zone_5",
neighbourhood_cleansed=="Milbertshofen-Am Hart"~"zone_5",
neighbourhood_cleansed=="Schwabing-Freimann"~"zone_3",
neighbourhood_cleansed=="Bogenhausen"~"zone_4",
neighbourhood_cleansed=="Berg am Laim"~"zone_4",
neighbourhood_cleansed=="Tudering-Riem"~"zone_1",
neighbourhood_cleansed=="Ramersdorf-Perlach"~"zone_5",
neighbourhood_cleansed=="Obergiesing"~"zone_2",
neighbourhood_cleansed=="Untergiesing-Harlaching"~"zone_4",
neighbourhood_cleansed=="Thalkirchen-Obersendling-Forstenried-Fürstenried-Solln"~"zone_3",
neighbourhood_cleansed=="Hadern"~"zone_5",
neighbourhood_cleansed=="Pasing-Obermenzing"~"zone_3",
neighbourhood_cleansed=="Aubing-Lochhausen-Langwied"~"zone_4",
neighbourhood_cleansed=="Allach-Untermenzing"~"zone_3",
neighbourhood_cleansed=="Feldmoching-Hasenbergl"~"zone_3",
neighbourhood_cleansed=="Laim"~"zone_5"
),
rating_group= case_when( #clustering review_scores_rating to 2 groups
review_scores_rating <90 ~ "Under 90",
TRUE ~ "Over 90"),
# is_pool=case_when(
# grepl("Pool", amenities, fixed=TRUE) ~ TRUE,
# TRUE ~FALSE),
# is_gym=case_when(
# grepl("Gym", amenities, fixed=TRUE) ~ TRUE,
# TRUE ~FALSE),
# is_private_entrance=case_when(
# grepl("Private entrance", amenities, fixed=TRUE) ~ TRUE,
# TRUE ~FALSE),
# is_balcony=case_when(
# grepl("balcony", amenities, fixed=TRUE) ~ TRUE,
# TRUE ~FALSE),
# is_kitchen=case_when(
# grepl("Kitchen", amenities, fixed=TRUE) ~ TRUE,
# TRUE ~FALSE),
is_elevator=case_when( # turned out to be significant
grepl("Elevator",
amenities,
fixed=TRUE) ~ TRUE,
TRUE ~FALSE),
# is_washer=case_when(
# grepl("Washer", amenities, fixed=TRUE) ~ TRUE,
# TRUE ~FALSE),
# is_dryer=case_when(
# grepl("Dryer", amenities, fixed=TRUE) ~ TRUE,
# TRUE ~FALSE),
# is_free_parking=case_when(
# grepl("Free parking on premises", amenities, fixed=TRUE) ~ TRUE,
# TRUE ~FALSE),
# is_paid_parking=case_when(
# grepl("Paid parking off premises", amenities, fixed=TRUE) ~ TRUE,
# TRUE ~FALSE),
# is_essentials=case_when(
# grepl("Essentials", amenities, fixed=TRUE) ~ TRUE,
# TRUE ~FALSE),
is_shampoo=case_when( #turned out to be significant
grepl("Shampoo",
amenities,
fixed=TRUE) ~ TRUE,
TRUE ~FALSE))
# is_host_greets_you=case_when(
# grepl("Host greets you", amenities, fixed=TRUE) ~ TRUE,
# TRUE ~FALSE),
# is_garden=case_when(
# grepl("Garden or backyard", amenities, fixed=TRUE) ~ TRUE,
# TRUE ~FALSE))
munich_listings_region <- munich_listings_region %>% #cleaning dataframe from all the missing values
na.omit()Here are description of the key variables in our dataset:
dependent variable:
total_price_4_days
independent variable:
property_type: type of accommodation (House, Apartment, etc.)
room_type: - Entire home/apt (guests have entire place to themselves)
Private room (Guests have private room to sleep, all other rooms shared)
Shared room (Guests sleep in room shared with others)
number_of_reviews: Total number of reviews for the listing
review_scores_rating: Average review score (0 - 100)
longitude , latitude: geographical coordinates to help us locate the listing
region: factor. Region the Airbnb is at grouping by house price. factored 1-5 from high price to low price
prop_type_simplified: type of accommodation (House, Apartment, Loft, Condominium)
room_type:Entire home/apt, Private room, Shared room
number_of_reviews: Total number of reviews for the listing
reviews_per_month: Number of reviews per month
review_scores_: Rating for in reviews in different aspects
rating_group: Average review score (0 - 100) grouped by 90
longitude , latitude: geographical coordinates to help us locate the listing
region: factor. Region the Airbnb is at grouping by house price. factored 1-5 from high price to low price
availability_365: Available days in the last 365 days
is_elevator and is_shampoo: Whether there is elevator or shampoo facilitated
Now that we have cleaned our data sets for our specific target (4 nights, 2 people) we will conduct a exploratory data analysis.
##Summary statistics and favstats
## id host_is_superhost host_listings_count
## Min. : 36720 Mode :logical Min. : 0
## 1st Qu.:12333597 FALSE:5045 1st Qu.: 1
## Median :22583936 TRUE :980 Median : 1
## Mean :22560646 Mean : 2
## 3rd Qu.:33438176 3rd Qu.: 2
## Max. :43697251 Max. :652
##
## neighbourhood_cleansed latitude longitude
## Ludwigsvorstadt-Isarvorstadt: 717 Min. :48.1 Min. :11.4
## Maxvorstadt : 666 1st Qu.:48.1 1st Qu.:11.5
## Schwabing-West : 446 Median :48.1 Median :11.6
## Neuhausen-Nymphenburg : 434 Mean :48.1 Mean :11.6
## Au-Haidhausen : 408 3rd Qu.:48.2 3rd Qu.:11.6
## Schwabing-Freimann : 339 Max. :48.2 Max. :11.7
## (Other) :3015
## property_type room_type accommodates bathrooms
## Length:6025 Entire home/apt:3544 Min. : 2.00 Min. :0.0
## Class :character Hotel room : 34 1st Qu.: 2.00 1st Qu.:1.0
## Mode :character Private room :2374 Median : 2.00 Median :1.0
## Shared room : 73 Mean : 2.92 Mean :1.1
## 3rd Qu.: 4.00 3rd Qu.:1.0
## Max. :16.00 Max. :4.5
##
## bedrooms beds bed_type price
## Min. : 0.00 Min. : 0.00 Airbed : 10 Min. : 11
## 1st Qu.: 1.00 1st Qu.: 1.00 Couch : 30 1st Qu.: 56
## Median : 1.00 Median : 1.00 Futon : 30 Median : 83
## Mean : 1.11 Mean : 1.61 Pull-out Sofa: 162 Mean : 116
## 3rd Qu.: 1.00 3rd Qu.: 2.00 Real Bed :5793 3rd Qu.: 128
## Max. :20.00 Max. :30.00 Max. :9000
##
## security_deposit cleaning_fee guests_included extra_people
## Min. : 0 Min. : 0.0 Min. : 1.0 Min. : 0.0
## 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 1.0 1st Qu.: 0.0
## Median : 0 Median : 19.0 Median : 1.0 Median : 10.0
## Mean : 169 Mean : 22.4 Mean : 1.5 Mean : 15.4
## 3rd Qu.: 200 3rd Qu.: 35.0 3rd Qu.: 2.0 3rd Qu.: 25.0
## Max. :4500 Max. :300.0 Max. :200.0 Max. :250.0
##
## minimum_nights maximum_nights number_of_reviews reviews_per_month
## Min. :1.00 Min. : 4 Min. : 1 Min. : 0.01
## 1st Qu.:1.00 1st Qu.: 17 1st Qu.: 2 1st Qu.: 0.12
## Median :2.00 Median : 1125 Median : 6 Median : 0.32
## Mean :1.88 Mean : 653 Mean : 22 Mean : 0.69
## 3rd Qu.:2.00 3rd Qu.: 1125 3rd Qu.: 18 3rd Qu.: 0.77
## Max. :4.00 Max. :111360 Max. :688 Max. :16.36
##
## review_scores_rating review_scores_accuracy review_scores_cleanliness
## Min. : 20.0 Min. : 2.00 Min. : 2.00
## 1st Qu.: 93.0 1st Qu.:10.00 1st Qu.: 9.00
## Median : 97.0 Median :10.00 Median :10.00
## Mean : 94.6 Mean : 9.67 Mean : 9.43
## 3rd Qu.:100.0 3rd Qu.:10.00 3rd Qu.:10.00
## Max. :100.0 Max. :10.00 Max. :10.00
##
## review_scores_checkin review_scores_communication review_scores_location
## Min. : 2.00 Min. : 2.0 Min. : 2.00
## 1st Qu.:10.00 1st Qu.:10.0 1st Qu.: 9.00
## Median :10.00 Median :10.0 Median :10.00
## Mean : 9.75 Mean : 9.8 Mean : 9.63
## 3rd Qu.:10.00 3rd Qu.:10.0 3rd Qu.:10.00
## Max. :10.00 Max. :10.0 Max. :10.00
##
## review_scores_value is_location_exact amenities instant_bookable
## Min. : 2.00 Mode :logical Length:6025 Mode :logical
## 1st Qu.: 9.00 FALSE:1177 Class :character FALSE:4096
## Median : 9.00 TRUE :4848 Mode :character TRUE :1929
## Mean : 9.25
## 3rd Qu.:10.00
## Max. :10.00
##
## cancellation_policy availability_365 availability_90
## flexible :2029 Min. : 0 Min. : 0.0
## moderate :2081 1st Qu.: 0 1st Qu.: 0.0
## strict : 0 Median : 0 Median : 0.0
## strict_14_with_grace_period:1913 Mean : 70 Mean :23.9
## super_strict_30 : 1 3rd Qu.: 90 3rd Qu.:55.0
## super_strict_60 : 1 Max. :365 Max. :90.0
##
## last_review listing_url last_scraped amenities_count
## Min. :2012-10-04 Length:6025 Min. :2020-06-20 Min. : 0.0
## 1st Qu.:2018-09-30 Class :character 1st Qu.:2020-06-20 1st Qu.:12.0
## Median :2019-10-04 Mode :character Median :2020-06-20 Median :17.0
## Mean :2019-03-01 Mean :2020-06-20 Mean :18.6
## 3rd Qu.:2020-01-06 3rd Qu.:2020-06-21 3rd Qu.:25.0
## Max. :2020-06-21 Max. :2020-06-21 Max. :82.0
##
## prop_type_simplified total_price_4_days region rating_group
## Apartment :5276 Min. : 44 Length:6025 Length:6025
## Condominium: 178 1st Qu.: 279 Class :character Class :character
## House : 215 Median : 392 Mode :character Mode :character
## Loft : 91 Mean : 513
## Other : 265 3rd Qu.: 568
## Max. :36050
##
## is_elevator is_shampoo
## Mode :logical Mode :logical
## FALSE:3349 FALSE:2471
## TRUE :2676 TRUE :3554
##
##
##
##
#running favstats on some interesting variable combinations and keeping the most interesting ones
favstats(price~accommodates, data=munich_listings_region) | accommodates | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| 2 | 15 | 50 | 70 | 100 | 999 | 84.9 | 65.5 | 3503 | 0 |
| 3 | 12 | 62 | 90 | 130 | 1e+03 | 109 | 79 | 850 | 0 |
| 4 | 11 | 79 | 110 | 180 | 8e+03 | 153 | 292 | 1123 | 0 |
| 5 | 35 | 93.8 | 144 | 200 | 1.12e+03 | 182 | 147 | 196 | 0 |
| 6 | 39 | 95.2 | 180 | 300 | 1e+03 | 231 | 191 | 198 | 0 |
| 7 | 34 | 89 | 135 | 190 | 700 | 176 | 129 | 43 | 0 |
| 8 | 25 | 129 | 249 | 450 | 995 | 324 | 254 | 61 | 0 |
| 9 | 65 | 125 | 226 | 288 | 950 | 311 | 325 | 6 | 0 |
| 10 | 25 | 196 | 294 | 612 | 1.45e+03 | 437 | 375 | 18 | 0 |
| 11 | 149 | 262 | 475 | 2.74e+03 | 9e+03 | 2.52e+03 | 4.32e+03 | 4 | 0 |
| 12 | 125 | 280 | 350 | 585 | 800 | 421 | 231 | 9 | 0 |
| 13 | 39 | 39 | 39 | 39 | 39 | 39 | 1 | 0 | |
| 14 | 185 | 242 | 300 | 360 | 420 | 302 | 118 | 3 | 0 |
| 16 | 35 | 35 | 35 | 111 | 839 | 145 | 250 | 10 | 0 |
| neighbourhood_cleansed | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| Allach-Untermenzing | 18 | 42 | 75 | 110 | 530 | 111 | 113 | 35 | 0 |
| Altstadt-Lehel | 25 | 80 | 120 | 180 | 800 | 153 | 111 | 222 | 0 |
| Au-Haidhausen | 25 | 60 | 85 | 120 | 1.45e+03 | 115 | 116 | 408 | 0 |
| Aubing-Lochhausen-Langwied | 16 | 41.5 | 65 | 149 | 380 | 99 | 81.3 | 56 | 0 |
| Berg am Laim | 25 | 55 | 76.5 | 131 | 400 | 103 | 75.7 | 110 | 0 |
| Bogenhausen | 23 | 57 | 80 | 120 | 500 | 97.9 | 66.4 | 297 | 0 |
| Feldmoching-Hasenbergl | 25 | 45 | 62.5 | 98.2 | 350 | 88.4 | 68 | 74 | 0 |
| Hadern | 15 | 45 | 79 | 100 | 350 | 84.4 | 58.4 | 73 | 0 |
| Laim | 20 | 50 | 80 | 121 | 585 | 100 | 82.8 | 216 | 0 |
| Ludwigsvorstadt-Isarvorstadt | 28 | 70 | 100 | 150 | 9e+03 | 172 | 499 | 717 | 0 |
| Maxvorstadt | 28 | 65 | 90 | 140 | 999 | 122 | 107 | 666 | 0 |
| Milbertshofen-Am Hart | 12 | 49 | 70 | 100 | 400 | 86 | 58.6 | 249 | 0 |
| Moosach | 25 | 50 | 70 | 100 | 800 | 101 | 113 | 103 | 0 |
| Neuhausen-Nymphenburg | 21 | 52.2 | 79.5 | 120 | 899 | 104 | 83.7 | 434 | 0 |
| Obergiesing | 15 | 50 | 80 | 130 | 700 | 109 | 98.9 | 213 | 0 |
| Pasing-Obermenzing | 21 | 46 | 70 | 125 | 800 | 105 | 104 | 119 | 0 |
| Ramersdorf-Perlach | 15 | 45 | 60 | 90 | 420 | 75 | 49.4 | 215 | 0 |
| Schwabing-Freimann | 20 | 55 | 80 | 120 | 1e+03 | 106 | 101 | 339 | 0 |
| Schwabing-West | 11 | 56.2 | 80 | 120 | 1e+03 | 107 | 89.6 | 446 | 0 |
| Schwanthalerhöhe | NaN | 0 | 0 | ||||||
| Sendling | 25 | 59.2 | 90 | 135 | 590 | 113 | 88.5 | 258 | 0 |
| Sendling-Westpark | 20 | 52 | 80 | 120 | 990 | 109 | 105 | 208 | 0 |
| Thalkirchen-Obersendling-Forstenried-Fürstenried-Solln | 25 | 50 | 75 | 120 | 1.12e+03 | 98.8 | 93.5 | 211 | 0 |
| Tudering-Riem | 30 | 50 | 75 | 120 | 999 | 127 | 152 | 161 | 0 |
| Untergiesing-Harlaching | 28 | 60 | 80 | 120 | 500 | 107 | 77.7 | 195 | 0 |
| host_is_superhost | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| FALSE | 11 | 59 | 85 | 130 | 9e+03 | 119 | 210 | 5045 | 0 |
| TRUE | 18 | 50 | 75 | 110 | 890 | 99.4 | 86.9 | 980 | 0 |
| prop_type_simplified | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| Apartment | 11 | 58 | 85 | 125 | 8e+03 | 112 | 159 | 5276 | 0 |
| Condominium | 19 | 55 | 89.5 | 154 | 995 | 140 | 150 | 178 | 0 |
| House | 20 | 45 | 65 | 100 | 890 | 96.8 | 108 | 215 | 0 |
| Loft | 35 | 75 | 98 | 140 | 9e+03 | 232 | 938 | 91 | 0 |
| Other | 20 | 53 | 88 | 130 | 999 | 145 | 188 | 265 | 0 |
| minimum_nights | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 15 | 52 | 80 | 120 | 8e+03 | 113 | 200 | 2257 | 0 |
| 2 | 11 | 59 | 85 | 125 | 9e+03 | 114 | 219 | 2529 | 0 |
| 3 | 15 | 60 | 89 | 144 | 1.45e+03 | 126 | 130 | 935 | 0 |
| 4 | 23 | 60 | 90 | 140 | 800 | 114 | 82.7 | 304 | 0 |
From the summary and favstats investigations, we have decided to conduct further exploratory data analysis through ggplot2. We will first build a correlation martix to spot the relationships between the particular variables.
munich_listing_is_numeric<-munich_listings_region[,sapply(munich_listings_region,is.numeric),with=FALSE]%>%
na.omit() #We have created a dataframe that contains only numerical variables from our original dataframe in order to build the Correlation Matrix.
corMatrix <- as.data.frame(cor(munich_listing_is_numeric))
corMatrix$var1 <- rownames(corMatrix)
corMatrix2 <- corMatrix %>%
gather(key = var2, value = r, 1:28) # selecting coloumns from dataframe
ggplot(corMatrix2,aes(x = var1, y = var2, fill = r)) +
geom_tile() +
geom_text(aes(label = round(r, 2)), size = 6) +
scale_fill_gradient2(low = "#ff585d", #adding colour to matrix
high = "#00bf6f",
mid = "white") +
labs(title = "Correlation Matrix") +
theme(axis.text.x = element_text(angle = 90,
hjust = 1,
size=16),
axis.text.y=element_text(size=16)) ## Further analysis for collinear variables
munich_listing_is_numeric[,18:24]%>% #We tried to spot the correlation between the review-related variables using ggpairs plot
ggpairs()munich_listing_is_numeric%>% #We used the ggpairs plot to further analyse the bottom left part of our correlation matrix
select(accommodates,bathrooms,
bedrooms,
beds,
cleaning_fee,
extra_people,
guests_included,
total_price_4_days,
security_deposit)%>%
ggpairs()The correlation matrix above displays two key ‘green zones’ where there are moderate to strong correlations present between variables. In the upper right corner, the plot illustrates the positive correlations between the various review score components, indicating that when an Airbnb scores well on one criterium it will tend to also have a higher rating on the other criteria. The strongest correlatio here is between the total review score and the review score for accuracy, at a level of 0.74. In the lower left corner we can see positive correlations between variables ranging from weak to strong. As one would expect, the number of people an Airbnb in Munich accomodates has a strong positive correlation with the number of beds and the number of bedrooms. There is a moderatore positive correlation between the total accomodated and the cleaning fee. Lastly, there is a moderate positive correlation between the cleaning fee and the security deposit, likely attributable to the fact that these properties are of a higher standard, as is mentioned on Airbnb’s website (deposits are usually based on a home’s features).
Looking at our independent variable of interest for this project, the total price for a 4-day stay for two people, we only find weak positive correlations when disregarding the obvious connection to daily price. With a level of 0.29 there is a weak to moderate positive correlation between the total price and the number of people an Airbnb can accommodate; this is further supported by weak correlations (0.22) between total price and the number of bedrooms and beds. We will now continue to investigate relationships between our variables, in particular categorical variables not included in the above matrix.
ggplot(listings,aes(x=number_of_reviews))+
geom_histogram(binwidth = 4)+
xlim(0,250)+
ylim(0,1000)+
labs(title="Most of the airbnb accomodations have up to 20 reviews",
x="Number of Reviews",
y="Quantity")+
theme_bw()ggplot(munich_listing_is_numeric,
aes(x=extra_people,y=total_price_4_days))+
geom_point()+
geom_smooth(method="lm")+
ylim(0,3000)+
xlim(0,100)+
labs(title="Higher the Extra People Charge, the Higher the Overall Price",
x="Price per extra person",
y="Total price for 4 nights")+
theme_bw()#heavily right-skewed
ggplot(munich_listings_region, aes(x=total_price_4_days))+
geom_density(bins=20)+
xlim(0,4000) +
labs(title="The density plot of total price for 4 nights is heavily right-skwed",
x="Density",
y="Total price for 4 nights")+
theme_bw()#log also heavily right skewed
ggplot(munich_listings_total_price, aes(x=total_price_4_days))+
geom_density(bins=20)+
scale_x_log10()+
xlim(0,2500) +
labs(title="Logarithmic Total Price Shows Nature of Price Clusters",
x="Density",
y="Total price for 4 nights")+
theme_bw()#histogram to show frequency
ggplot(munich_listings_total_price, aes(x=total_price_4_days))+
geom_histogram(bins=100)+
xlim(0,2500)+
labs(title="Most Airbnbs cost around €300 for 4 Nights",
x="Total price for 4 nights",
y= "Quantity")+
theme_bw()munich_listings_region %>%
group_by(room_type) %>%
summarize(mean_price_roomtype = mean(total_price_4_days)) %>%
arrange(desc(mean_price_roomtype)) %>%
ggplot(aes(y=reorder(room_type, mean_price_roomtype), x = mean_price_roomtype)) +
geom_col() +
labs(title="What are The Most Expensive Airbnb's in Munich?",
subtitle="Hotels! who would have thought?",
x="Average price for 4 nights per room",
y="Room type")+
theme_bw()#Calculated mean price for 4 nights per room type
munich_listings_region %>%
group_by(neighbourhood_cleansed) %>%
summarize(mean_price_neighbourhood = mean(total_price_4_days)) %>%
arrange(desc(mean_price_neighbourhood)) %>%
ggplot(aes(y=reorder(neighbourhood_cleansed, mean_price_neighbourhood), x=mean_price_neighbourhood)) +
geom_col()+
labs(title="Average price for 4 nights per in particular neighbourhoods",
x="Average price for per room",
y="Neighbourhood")+
theme_bw()#Calculated mean price for 4 nights per neighbourhood
munich_listings_region %>%
group_by(prop_type_simplified) %>%
summarize(mean_price_property = mean(total_price_4_days)) %>%
arrange(desc(mean_price_property)) %>%
ggplot(aes(y=reorder(prop_type_simplified, mean_price_property), x = mean_price_property)) +
geom_col() +
labs(title="Lofts Come at a Premium in Munich, Houses Present a Good Value Proposition",
x="Average price for per room",
y="Property type")+
theme_bw()#Calculated mean price for 4 nights per property type
munich_listings_region %>%
group_by(prop_type_simplified) %>%
mutate(count_property=count("Apartment")) %>%
arrange((count_property)) %>%
ggplot(aes(x=reorder(prop_type_simplified, desc(count_property)), y = count_property)) +
geom_col() +
labs(title="Apartments Dominate Airbnb's Listings",
x="Property type",
y="Quantity")+
theme_bw()#Calculated count of particular property types
munich_listings_region %>%
group_by(cancellation_policy) %>%
ggplot(aes(x=reorder(cancellation_policy,total_price_4_days ), y = total_price_4_days)) +
geom_boxplot() +
labs(title="Average prices per 4 nights for an Airbnb according to particular cancellation policies",
y="Price",
x="Cancellation policy")+
scale_y_log10(limits=c(100,10000))+
theme_bw()Now, we will conduct the mapping of our locations on the Munich map. We decided to colour our data in regards to a particular zone they are located in, to have a better sense of the density of the accommodation in these zones. The zones were grouped by highest mean rental price, since it created the largest significance in our models later on.
pallette <- colorFactor(c("red", "blue", "green", "yellow","purple"), domain = c("zone_1", "zone_2", "zone_3", "zone_4","zone_5"))
leaflet(data = munich_listings_region) %>%
addProviderTiles("OpenStreetMap.Mapnik") %>%
addCircleMarkers(lng = ~longitude,
lat = ~latitude,
radius = 2,
color = ~pallette(region),
fillColor = ~region,
group = ~ region,
clusterId=~region,
fillOpacity = 0.4,
popup = ~listing_url,
label = ~paste( prop_type_simplified, "Min nights", "=", minimum_nights))Now we will start building our models. We will start from models with only a few variables and we will gradually try to build the model with the best fitting data and the biggest possible adjusted R-squared value. Running each model, we will as well check the colinearity analysis to cut confounding variables. For that reason we will use `car::vif(model_x)`` to calculate the Variance Inflation Factor (VIF) for our predictors. A general guideline is that a VIF larger than 5 or 10 is large, and our model may suffer from colinearity. We will remove the variable in question and run our model again without it if such a VIF occurs.
For our models we will use the log value of total_prices_4_days since the distribution of it is more bell shaped than the regular value and thus will be better descried by our model.
We will fit our first regression model called model1 with the following explanatory variables: prop_type_simplified, number_of_reviews, and review_scores_rating.
#Regression using log because normally distributed.
model1 <-lm(log(total_price_4_days)~prop_type_simplified+
number_of_reviews+
review_scores_rating,
data=munich_listings_region)
msummary(model1)## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.081471 0.084086 72.32 < 2e-16 ***
## prop_type_simplifiedCondominium 0.146014 0.044434 3.29 0.0010 **
## prop_type_simplifiedHouse -0.111465 0.040595 -2.75 0.0061 **
## prop_type_simplifiedLoft 0.286691 0.061669 4.65 3.4e-06 ***
## prop_type_simplifiedOther 0.109228 0.036817 2.97 0.0030 **
## number_of_reviews -0.001290 0.000157 -8.23 2.3e-16 ***
## review_scores_rating -0.000471 0.000883 -0.53 0.5942
##
## Residual standard error: 0.583 on 6018 degrees of freedom
## Multiple R-squared: 0.0181, Adjusted R-squared: 0.0172
## F-statistic: 18.5 on 6 and 6018 DF, p-value: <2e-16
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.01 4 1
## number_of_reviews 1.01 1 1
## review_scores_rating 1.01 1 1
#Noticed that variable review_scores_rating and "Other" and "House" categories in prop_type_simplified are also insignificant. Dropping review_scores_rating.After running model1, we can notice, that “review_scores_rating” is insignificant for our linear regression model as the p-value is bigger than 0.05.Therefore we will drop it. Our dummy variable “prop_type_simplified” turned out to be insignificant for Houses and Other property types. Anyway, we will keep the variable prop_type_simplified as some of it’s variables are important for our model. Our Adjusted R-squared in this model is only 2,25%. We will try to fit more variables in our model in order to increase the accuracy.
We will add as well an example of interpretation of our data in logarithmic lm model.
The coefficient interpretation of review_scores_rating in regards to total_price_4_days is as follows: If the review_scores_rating increases by one, the total_price_4_days decreases by 0,0003%.
The coefficient interpretation of prop_type_simplified in regards to total_price_4_days is as follows: In regards to a particular property type the total_price_4_days behaves as follows: - (property type: Apartment) : total_price_4_days just takes the “Intercept” variable and increases by 6,08%. - (property type: Condominium) : prop_type_simplifiedCondominium=1; total_price_4_days increases by 0.18%. - (property type: House): prop_type_simplifiedHouse=1; total_price_4_days decreases by 0,065%. - (property type: Loft): prop_type_simplifiedLoft=1; total_price_4_days increases by 0.301%. - (property type: Other): prop_type_simplifiedOther=1; total_price_4_days increases by 0.06%.
We want to determine if room_type is a significant predictor of the cost for 4 nights, given everything else in the model. We will fit a regression model that includes all of the explanatory variables in model1 plus room_type.
model2 <-lm(log(total_price_4_days)~prop_type_simplified+
number_of_reviews+
review_scores_rating+
room_type,
data=munich_listings_region)
msummary(model2)## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.149860 0.080197 76.68 < 2e-16 ***
## prop_type_simplifiedCondominium 0.115210 0.042271 2.73 0.00644 **
## prop_type_simplifiedHouse 0.043474 0.039020 1.11 0.26526
## prop_type_simplifiedLoft 0.197614 0.058641 3.37 0.00076 ***
## prop_type_simplifiedOther 0.112799 0.037391 3.02 0.00257 **
## number_of_reviews -0.001201 0.000149 -8.07 8.6e-16 ***
## review_scores_rating 0.000333 0.000842 0.40 0.69229
## room_typeHotel room 0.309529 0.101913 3.04 0.00240 **
## room_typePrivate room -0.377170 0.014942 -25.24 < 2e-16 ***
## room_typeShared room -0.242816 0.065713 -3.70 0.00022 ***
##
## Residual standard error: 0.553 on 6015 degrees of freedom
## Multiple R-squared: 0.116, Adjusted R-squared: 0.115
## F-statistic: 87.8 on 9 and 6015 DF, p-value: <2e-16
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.20 4 1.02
## number_of_reviews 1.01 1 1.01
## review_scores_rating 1.02 1 1.01
## room_type 1.20 3 1.03
The room_type has increased our adjusted R-squared up to 0.13. The p-value for each room type is less than 0,05, thus the room type variable is important and we will keep it in our model.
Are the number of bathrooms, bedrooms, beds, or size of the house (accommodates) significant predictors of price_4_nights?
model3 <-lm(log(total_price_4_days)~prop_type_simplified+
number_of_reviews+
room_type+
bathrooms+
bedrooms+
beds+
accommodates,
data=munich_listings_region)
msummary(model3)## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.567317 0.026569 209.54 < 2e-16 ***
## prop_type_simplifiedCondominium 0.081447 0.038663 2.11 0.0352 *
## prop_type_simplifiedHouse -0.103669 0.036075 -2.87 0.0041 **
## prop_type_simplifiedLoft 0.110756 0.053773 2.06 0.0395 *
## prop_type_simplifiedOther 0.009137 0.034343 0.27 0.7902
## number_of_reviews -0.001494 0.000136 -10.96 < 2e-16 ***
## room_typeHotel room 0.538304 0.093237 5.77 8.2e-09 ***
## room_typePrivate room -0.248967 0.014497 -17.17 < 2e-16 ***
## room_typeShared room -0.286653 0.059913 -4.78 1.8e-06 ***
## bathrooms 0.121782 0.023689 5.14 2.8e-07 ***
## bedrooms 0.052133 0.013074 3.99 6.8e-05 ***
## beds -0.024632 0.008246 -2.99 0.0028 **
## accommodates 0.146740 0.006864 21.38 < 2e-16 ***
##
## Residual standard error: 0.504 on 6012 degrees of freedom
## Multiple R-squared: 0.264, Adjusted R-squared: 0.263
## F-statistic: 180 on 12 and 6012 DF, p-value: <2e-16
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.27 4 1.03
## number_of_reviews 1.02 1 1.01
## room_type 1.36 3 1.05
## bathrooms 1.22 1 1.11
## bedrooms 1.96 1 1.40
## beds 2.49 1 1.58
## accommodates 2.52 1 1.59
All the variables in our model apart from “beds” variable ware significant as t-value of these variables is more than 2. In our further models we will keep “bedrooms”, “bathrooms” and “accommodates”, however we will drop the “beds”.
Do superhosts (host_is_superhost) command a pricing premium, after controlling for other variables?
model4 <-lm(log(total_price_4_days)~prop_type_simplified+
number_of_reviews+
room_type+
bathrooms+
bedrooms+
accommodates+
host_is_superhost,
data=munich_listings_region)
msummary(model4)## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.579593 0.026354 211.72 < 2e-16 ***
## prop_type_simplifiedCondominium 0.076031 0.038645 1.97 0.0492 *
## prop_type_simplifiedHouse -0.107762 0.036075 -2.99 0.0028 **
## prop_type_simplifiedLoft 0.108861 0.053816 2.02 0.0431 *
## prop_type_simplifiedOther 0.000282 0.034238 0.01 0.9934
## number_of_reviews -0.001474 0.000142 -10.38 < 2e-16 ***
## room_typeHotel room 0.540528 0.093317 5.79 7.3e-09 ***
## room_typePrivate room -0.247167 0.014536 -17.00 < 2e-16 ***
## room_typeShared room -0.291356 0.059946 -4.86 1.2e-06 ***
## bathrooms 0.117383 0.023657 4.96 7.2e-07 ***
## bedrooms 0.038997 0.012329 3.16 0.0016 **
## accommodates 0.136191 0.005883 23.15 < 2e-16 ***
## host_is_superhostTRUE -0.012966 0.018451 -0.70 0.4823
##
## Residual standard error: 0.505 on 6012 degrees of freedom
## Multiple R-squared: 0.263, Adjusted R-squared: 0.262
## F-statistic: 179 on 12 and 6012 DF, p-value: <2e-16
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.25 4 1.03
## number_of_reviews 1.10 1 1.05
## room_type 1.37 3 1.05
## bathrooms 1.22 1 1.10
## bedrooms 1.74 1 1.32
## accommodates 1.85 1 1.36
## host_is_superhost 1.10 1 1.05
Superhosts do not command a pricing premium in Munich, therefore we will drop this variable in our further models. We can see that the VIF for bedrooms and accommodates has a bit higher VIF, however it is still not high enough to worry about it.
Most owners advertise the exact location of their listing (is_location_exact == TRUE), while a non-trivial proportion don’t. After controlling for other variables, is a listing’s exact location a significant predictor of price_4_nights?
model5 <-lm(log(total_price_4_days)~prop_type_simplified+
number_of_reviews+
room_type+bathrooms+
bedrooms+accommodates+
is_location_exact,
data=munich_listings_region)
msummary(model5)## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.581507 0.028925 192.96 < 2e-16 ***
## prop_type_simplifiedCondominium 0.075406 0.038658 1.95 0.0512 .
## prop_type_simplifiedHouse -0.108313 0.036068 -3.00 0.0027 **
## prop_type_simplifiedLoft 0.108223 0.053811 2.01 0.0444 *
## prop_type_simplifiedOther 0.000280 0.034239 0.01 0.9935
## number_of_reviews -0.001500 0.000137 -10.99 < 2e-16 ***
## room_typeHotel room 0.539701 0.093317 5.78 7.7e-09 ***
## room_typePrivate room -0.247933 0.014506 -17.09 < 2e-16 ***
## room_typeShared room -0.291236 0.059974 -4.86 1.2e-06 ***
## bathrooms 0.117470 0.023677 4.96 7.2e-07 ***
## bedrooms 0.039067 0.012329 3.17 0.0015 **
## accommodates 0.136165 0.005883 23.15 < 2e-16 ***
## is_location_exactTRUE -0.003974 0.016473 -0.24 0.8094
##
## Residual standard error: 0.505 on 6012 degrees of freedom
## Multiple R-squared: 0.263, Adjusted R-squared: 0.262
## F-statistic: 179 on 12 and 6012 DF, p-value: <2e-16
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.25 4 1.03
## number_of_reviews 1.02 1 1.01
## room_type 1.36 3 1.05
## bathrooms 1.22 1 1.10
## bedrooms 1.74 1 1.32
## accommodates 1.85 1 1.36
## is_location_exact 1.01 1 1.00
The variable “is_location_exact” does not have a significant influence on the price of an Airbnb in Munich (p-value bigger than 0.05). Therefore, we will drop it.
Now we will use a variable that we created - “region” that clusters all the neighbourhood to 5 zones and we will see how the location affects the price for Airbnb in our model.
model6 <-lm(log(total_price_4_days)~prop_type_simplified+
number_of_reviews+
room_type+
bathrooms+
bedrooms+
accommodates+
region,
data=munich_listings_region)
msummary(model6)## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.713665 0.027665 206.53 < 2e-16 ***
## prop_type_simplifiedCondominium 0.067621 0.037781 1.79 0.07353 .
## prop_type_simplifiedHouse -0.051486 0.035537 -1.45 0.14745
## prop_type_simplifiedLoft 0.112969 0.052587 2.15 0.03174 *
## prop_type_simplifiedOther 0.014181 0.033591 0.42 0.67291
## number_of_reviews -0.001600 0.000133 -11.99 < 2e-16 ***
## room_typeHotel room 0.487240 0.091290 5.34 9.8e-08 ***
## room_typePrivate room -0.233688 0.014208 -16.45 < 2e-16 ***
## room_typeShared room -0.266269 0.058622 -4.54 5.7e-06 ***
## bathrooms 0.113342 0.023132 4.90 9.8e-07 ***
## bedrooms 0.042178 0.012056 3.50 0.00047 ***
## accommodates 0.136502 0.005750 23.74 < 2e-16 ***
## regionzone_2 -0.134851 0.017317 -7.79 8.0e-15 ***
## regionzone_3 -0.191581 0.018510 -10.35 < 2e-16 ***
## regionzone_4 -0.194712 0.022643 -8.60 < 2e-16 ***
## regionzone_5 -0.331023 0.020700 -15.99 < 2e-16 ***
##
## Residual standard error: 0.493 on 6009 degrees of freedom
## Multiple R-squared: 0.297, Adjusted R-squared: 0.295
## F-statistic: 169 on 15 and 6009 DF, p-value: <2e-16
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.28 4 1.03
## number_of_reviews 1.02 1 1.01
## room_type 1.37 3 1.05
## bathrooms 1.22 1 1.10
## bedrooms 1.75 1 1.32
## accommodates 1.85 1 1.36
## region 1.04 4 1.01
The region of Munich has a significant influence on the price. T-value of all the zone is way more than |2| and our adjusted R-squared went up - it suggests that model 6 better describes the real data than our previous models.
What is the effect of cancellation_policy on price_4_nights, after we control for other variables?
model7 <-lm(log(total_price_4_days)~prop_type_simplified+
number_of_reviews+
room_type+
bathrooms+
bedrooms+
accommodates+
region+
cancellation_policy,
data=munich_listings_region)
msummary(model7)## Estimate Std. Error t value
## (Intercept) 5.684569 0.028818 197.26
## prop_type_simplifiedCondominium 0.058108 0.037529 1.55
## prop_type_simplifiedHouse -0.058809 0.035306 -1.67
## prop_type_simplifiedLoft 0.113839 0.052225 2.18
## prop_type_simplifiedOther 0.030525 0.033403 0.91
## number_of_reviews -0.001666 0.000133 -12.52
## room_typeHotel room 0.495942 0.090674 5.47
## room_typePrivate room -0.227447 0.014170 -16.05
## room_typeShared room -0.279824 0.058242 -4.80
## bathrooms 0.108761 0.022977 4.73
## bedrooms 0.044928 0.011978 3.75
## accommodates 0.130261 0.005750 22.66
## regionzone_2 -0.129482 0.017208 -7.52
## regionzone_3 -0.182387 0.018411 -9.91
## regionzone_4 -0.188113 0.022498 -8.36
## regionzone_5 -0.321764 0.020590 -15.63
## cancellation_policymoderate 0.002518 0.015381 0.16
## cancellation_policystrict_14_with_grace_period 0.132608 0.015965 8.31
## cancellation_policysuper_strict_30 0.131256 0.491330 0.27
## cancellation_policysuper_strict_60 0.025796 0.490214 0.05
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## prop_type_simplifiedCondominium 0.12160
## prop_type_simplifiedHouse 0.09582 .
## prop_type_simplifiedLoft 0.02931 *
## prop_type_simplifiedOther 0.36084
## number_of_reviews < 2e-16 ***
## room_typeHotel room 4.7e-08 ***
## room_typePrivate room < 2e-16 ***
## room_typeShared room 1.6e-06 ***
## bathrooms 2.3e-06 ***
## bedrooms 0.00018 ***
## accommodates < 2e-16 ***
## regionzone_2 6.1e-14 ***
## regionzone_3 < 2e-16 ***
## regionzone_4 < 2e-16 ***
## regionzone_5 < 2e-16 ***
## cancellation_policymoderate 0.86997
## cancellation_policystrict_14_with_grace_period < 2e-16 ***
## cancellation_policysuper_strict_30 0.78937
## cancellation_policysuper_strict_60 0.95804
##
## Residual standard error: 0.49 on 6005 degrees of freedom
## Multiple R-squared: 0.307, Adjusted R-squared: 0.305
## F-statistic: 140 on 19 and 6005 DF, p-value: <2e-16
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.29 4 1.03
## number_of_reviews 1.03 1 1.01
## room_type 1.39 3 1.06
## bathrooms 1.22 1 1.11
## bedrooms 1.75 1 1.32
## accommodates 1.87 1 1.37
## region 1.05 4 1.01
## cancellation_policy 1.06 4 1.01
The cancellation policy of 14 days seems to have a significant impact on the price for 4 nights. This is why we will keep the variable “cancellation policy” in our model. Our Adjusted R-squared again went up by one percent. Let us keep trying adding more variables that may turn out significant for our model.
Now we will create the model with numerous significant data that we checked to be relevant and significant to create our best fitting regression model.
model_wild_west<-lm(log10(total_price_4_days)~ #predicting total_price_4_days on variables below
prop_type_simplified+
number_of_reviews* #multiplied because of colinearity
reviews_per_month+
room_type* # multiplied because of colinearity
bedrooms+
bathrooms+
accommodates+
region+
cancellation_policy+
review_scores_value+
review_scores_cleanliness+
review_scores_checkin+
review_scores_location+
security_deposit+
rating_group+
instant_bookable+
availability_365+
availability_90+
maximum_nights+
minimum_nights+
is_elevator+
is_shampoo,
data=munich_listings_region)
msummary(model_wild_west) ## Estimate Std. Error t value
## (Intercept) 2.35e+00 5.21e-02 45.09
## prop_type_simplifiedCondominium 2.03e-02 1.54e-02 1.32
## prop_type_simplifiedHouse -2.52e-02 1.47e-02 -1.71
## prop_type_simplifiedLoft 4.56e-02 2.14e-02 2.13
## prop_type_simplifiedOther 1.87e-02 1.39e-02 1.34
## number_of_reviews -1.01e-03 1.28e-04 -7.86
## reviews_per_month -4.74e-02 3.97e-03 -11.94
## room_typeHotel room 2.74e-01 6.36e-02 4.30
## room_typePrivate room -5.22e-04 1.07e-02 -0.05
## room_typeShared room -1.32e-01 2.41e-02 -5.48
## bedrooms 5.58e-02 5.56e-03 10.02
## bathrooms 4.55e-02 9.43e-03 4.82
## accommodates 5.00e-02 2.38e-03 21.01
## regionzone_2 -5.66e-02 7.07e-03 -8.00
## regionzone_3 -7.67e-02 7.61e-03 -10.08
## regionzone_4 -7.96e-02 9.27e-03 -8.58
## regionzone_5 -1.28e-01 8.61e-03 -14.88
## cancellation_policymoderate 5.87e-03 6.37e-03 0.92
## cancellation_policystrict_14_with_grace_period 4.40e-02 6.69e-03 6.58
## cancellation_policysuper_strict_30 -4.29e-02 2.01e-01 -0.21
## cancellation_policysuper_strict_60 -6.90e-02 2.00e-01 -0.34
## review_scores_value -4.61e-02 3.47e-03 -13.28
## review_scores_cleanliness 2.21e-02 3.30e-03 6.70
## review_scores_checkin 1.27e-02 4.25e-03 2.99
## review_scores_location 2.11e-02 4.22e-03 4.99
## security_deposit 4.84e-05 7.27e-06 6.65
## rating_groupUnder 90 -3.06e-02 9.32e-03 -3.28
## instant_bookableTRUE 3.66e-02 5.77e-03 6.35
## availability_365 1.39e-04 3.59e-05 3.86
## availability_90 7.09e-04 1.12e-04 6.35
## maximum_nights 3.34e-06 1.58e-06 2.12
## minimum_nights -1.58e-02 3.26e-03 -4.84
## is_elevatorTRUE 1.32e-02 5.39e-03 2.44
## is_shampooTRUE 1.04e-02 5.38e-03 1.94
## number_of_reviews:reviews_per_month 1.87e-04 2.43e-05 7.69
## room_typeHotel room:bedrooms -1.04e-01 4.78e-02 -2.18
## room_typePrivate room:bedrooms -1.03e-01 8.32e-03 -12.40
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## prop_type_simplifiedCondominium 0.18710
## prop_type_simplifiedHouse 0.08652 .
## prop_type_simplifiedLoft 0.03314 *
## prop_type_simplifiedOther 0.17900
## number_of_reviews 4.5e-15 ***
## reviews_per_month < 2e-16 ***
## room_typeHotel room 1.7e-05 ***
## room_typePrivate room 0.96121
## room_typeShared room 4.5e-08 ***
## bedrooms < 2e-16 ***
## bathrooms 1.4e-06 ***
## accommodates < 2e-16 ***
## regionzone_2 1.4e-15 ***
## regionzone_3 < 2e-16 ***
## regionzone_4 < 2e-16 ***
## regionzone_5 < 2e-16 ***
## cancellation_policymoderate 0.35735
## cancellation_policystrict_14_with_grace_period 5.0e-11 ***
## cancellation_policysuper_strict_30 0.83092
## cancellation_policysuper_strict_60 0.73042
## review_scores_value < 2e-16 ***
## review_scores_cleanliness 2.3e-11 ***
## review_scores_checkin 0.00280 **
## review_scores_location 6.3e-07 ***
## security_deposit 3.2e-11 ***
## rating_groupUnder 90 0.00103 **
## instant_bookableTRUE 2.3e-10 ***
## availability_365 0.00011 ***
## availability_90 2.3e-10 ***
## maximum_nights 0.03416 *
## minimum_nights 1.3e-06 ***
## is_elevatorTRUE 0.01473 *
## is_shampooTRUE 0.05265 .
## number_of_reviews:reviews_per_month 1.7e-14 ***
## room_typeHotel room:bedrooms 0.02899 *
## room_typePrivate room:bedrooms < 2e-16 ***
##
## Residual standard error: 0.2 on 5988 degrees of freedom
## Multiple R-squared: 0.39, Adjusted R-squared: 0.386
## F-statistic: 106 on 36 and 5988 DF, p-value: <2e-16
model_wild_west_colinear<-lm(log10(total_price_4_days)~ #predicting total_price_4_days on variables below
prop_type_simplified+
number_of_reviews+ #linearised for colinearity
reviews_per_month+
room_type+ # linearised for colinearity
bedrooms+
bathrooms+
accommodates+
region+
cancellation_policy+
review_scores_value+
review_scores_cleanliness+
review_scores_checkin+
review_scores_location+
security_deposit+
rating_group+
instant_bookable+
availability_365+
availability_90+
maximum_nights+
minimum_nights+
is_elevator+
is_shampoo,
data=munich_listings_region)
car::vif(model_wild_west_colinear) # car VIF struggles with multiplied variables so a new unmultiplied model is used to check.## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.40 4 1.04
## number_of_reviews 2.47 1 1.57
## reviews_per_month 2.61 1 1.62
## room_type 1.62 3 1.08
## bedrooms 1.78 1 1.34
## bathrooms 1.23 1 1.11
## accommodates 1.93 1 1.39
## region 1.12 4 1.01
## cancellation_policy 1.14 4 1.02
## review_scores_value 1.80 1 1.34
## review_scores_cleanliness 1.69 1 1.30
## review_scores_checkin 1.49 1 1.22
## review_scores_location 1.42 1 1.19
## security_deposit 1.07 1 1.03
## rating_group 1.62 1 1.27
## instant_bookable 1.09 1 1.04
## availability_365 2.53 1 1.59
## availability_90 2.41 1 1.55
## maximum_nights 1.02 1 1.01
## minimum_nights 1.15 1 1.07
## is_elevator 1.07 1 1.04
## is_shampoo 1.05 1 1.03
In our final model we tested variables from the previous models that were significant and tested much more variables that in our opinion could as well affect the total_price_4_days. We tested the variables connected to review scores - i.e. review_scores_value, review_scores_cleanliness, review_scores_checking_ review_scores_location etc. Only the ones mentioned turned out to be significant for the model. Afterwards we check for security_deposit, rating_group, instants_bookable and availability variables. Two of them (availability_60 and availability_30) turned out to be insignificant, so we decided to drop them. Thereafter, we added host_listings_count as we believe that the number of properties the host has may affect the standard, build some economies of scales perhaps and therefore affect somehow the price. This factor as well turned out to be significant. Later we tested maximum_nights and minimum_nights. In the next step we were testing whether particular types of amenities have any significant impact on the price. It turned out that two of them - elevator and shampoo (as they are always part of some welcome packs) are also significant for the price’s prediction. Moreover, we added two interaction variables - room_type&bedrooms and number_of_reviews&review_per_month as we believe there is much interaction happening between them. Our final model has adjusted R-squared at the level of 38.7% and a RSE at the level of 0.2. Checking the VIF throughout, we can see that the GVIF value is well below 5 and we can be assured that the colinearity is not affecting our model significantly.
In the next step we will plot residuals, analyze their behaviour and check whether they are distributed within the norms. Afterwards we will compare all the models and compare how our models evolved.
# comparing significance of variables among model iterations
huxreg(model2,
model3,
model6,
model7,
model_wild_west)| (1) | (2) | (3) | (4) | (5) | |
|---|---|---|---|---|---|
| (Intercept) | 6.150 *** | 5.567 *** | 5.714 *** | 5.685 *** | 2.348 *** |
| (0.080) | (0.027) | (0.028) | (0.029) | (0.052) | |
| prop_type_simplifiedCondominium | 0.115 ** | 0.081 * | 0.068 | 0.058 | 0.020 |
| (0.042) | (0.039) | (0.038) | (0.038) | (0.015) | |
| prop_type_simplifiedHouse | 0.043 | -0.104 ** | -0.051 | -0.059 | -0.025 |
| (0.039) | (0.036) | (0.036) | (0.035) | (0.015) | |
| prop_type_simplifiedLoft | 0.198 *** | 0.111 * | 0.113 * | 0.114 * | 0.046 * |
| (0.059) | (0.054) | (0.053) | (0.052) | (0.021) | |
| prop_type_simplifiedOther | 0.113 ** | 0.009 | 0.014 | 0.031 | 0.019 |
| (0.037) | (0.034) | (0.034) | (0.033) | (0.014) | |
| number_of_reviews | -0.001 *** | -0.001 *** | -0.002 *** | -0.002 *** | -0.001 *** |
| (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | |
| review_scores_rating | 0.000 | ||||
| (0.001) | |||||
| room_typeHotel room | 0.310 ** | 0.538 *** | 0.487 *** | 0.496 *** | 0.274 *** |
| (0.102) | (0.093) | (0.091) | (0.091) | (0.064) | |
| room_typePrivate room | -0.377 *** | -0.249 *** | -0.234 *** | -0.227 *** | -0.001 |
| (0.015) | (0.014) | (0.014) | (0.014) | (0.011) | |
| room_typeShared room | -0.243 *** | -0.287 *** | -0.266 *** | -0.280 *** | -0.132 *** |
| (0.066) | (0.060) | (0.059) | (0.058) | (0.024) | |
| bathrooms | 0.122 *** | 0.113 *** | 0.109 *** | 0.045 *** | |
| (0.024) | (0.023) | (0.023) | (0.009) | ||
| bedrooms | 0.052 *** | 0.042 *** | 0.045 *** | 0.056 *** | |
| (0.013) | (0.012) | (0.012) | (0.006) | ||
| beds | -0.025 ** | ||||
| (0.008) | |||||
| accommodates | 0.147 *** | 0.137 *** | 0.130 *** | 0.050 *** | |
| (0.007) | (0.006) | (0.006) | (0.002) | ||
| regionzone_2 | -0.135 *** | -0.129 *** | -0.057 *** | ||
| (0.017) | (0.017) | (0.007) | |||
| regionzone_3 | -0.192 *** | -0.182 *** | -0.077 *** | ||
| (0.019) | (0.018) | (0.008) | |||
| regionzone_4 | -0.195 *** | -0.188 *** | -0.080 *** | ||
| (0.023) | (0.022) | (0.009) | |||
| regionzone_5 | -0.331 *** | -0.322 *** | -0.128 *** | ||
| (0.021) | (0.021) | (0.009) | |||
| cancellation_policymoderate | 0.003 | 0.006 | |||
| (0.015) | (0.006) | ||||
| cancellation_policystrict_14_with_grace_period | 0.133 *** | 0.044 *** | |||
| (0.016) | (0.007) | ||||
| cancellation_policysuper_strict_30 | 0.131 | -0.043 | |||
| (0.491) | (0.201) | ||||
| cancellation_policysuper_strict_60 | 0.026 | -0.069 | |||
| (0.490) | (0.200) | ||||
| reviews_per_month | -0.047 *** | ||||
| (0.004) | |||||
| review_scores_value | -0.046 *** | ||||
| (0.003) | |||||
| review_scores_cleanliness | 0.022 *** | ||||
| (0.003) | |||||
| review_scores_checkin | 0.013 ** | ||||
| (0.004) | |||||
| review_scores_location | 0.021 *** | ||||
| (0.004) | |||||
| security_deposit | 0.000 *** | ||||
| (0.000) | |||||
| rating_groupUnder 90 | -0.031 ** | ||||
| (0.009) | |||||
| instant_bookableTRUE | 0.037 *** | ||||
| (0.006) | |||||
| availability_365 | 0.000 *** | ||||
| (0.000) | |||||
| availability_90 | 0.001 *** | ||||
| (0.000) | |||||
| maximum_nights | 0.000 * | ||||
| (0.000) | |||||
| minimum_nights | -0.016 *** | ||||
| (0.003) | |||||
| is_elevatorTRUE | 0.013 * | ||||
| (0.005) | |||||
| is_shampooTRUE | 0.010 | ||||
| (0.005) | |||||
| number_of_reviews:reviews_per_month | 0.000 *** | ||||
| (0.000) | |||||
| room_typeHotel room:bedrooms | -0.104 * | ||||
| (0.048) | |||||
| room_typePrivate room:bedrooms | -0.103 *** | ||||
| (0.008) | |||||
| room_typeShared room:bedrooms | |||||
| N | 6025 | 6025 | 6025 | 6025 | 6025 |
| R2 | 0.116 | 0.264 | 0.297 | 0.307 | 0.390 |
| logLik | -4973.056 | -4419.501 | -4283.651 | -4238.792 | 1169.906 |
| AIC | 9968.112 | 8867.003 | 8601.301 | 8519.583 | -2263.812 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |||||
From the table comparing all the models we can spot, that our R-squared went up through out the process of finding the best solution. We can as well spot which variables were added and dropped at which stages.
##Model applyinh and predicting the outcome
Now, we will find a price of the Airbnbs that are apartment with a private room, have at least 10 reviews, and an average rating of at least 90.
We are using our logarithmic model log(total_price_4_days) in the predict function since our regression is based in the log(total_price_4_days). First, we will create a new table that we will filter according to the conditions above. In the next step, we will anti-log our model_wild_west. At the end, we will predict the prices for our filtered accommodations and we will create for them the Confidence Intervals. We will do it in two ways in order to compare our scores.
munich_listings_predict<- munich_listings_region %>%
mutate(price=log(total_price_4_days)) %>% #converting to log form for prediction
filter(room_type=="Private room" &
number_of_reviews>=10 &
rating_group=="Over 90")
predict_df<-10^predict(model_wild_west, # converting from log form to nominal
newdata = munich_listings_predict,
interval= "confidence")
#sanity check
summary(predict_df)## fit lwr upr
## Min. :138 Min. :119 Min. :160
## 1st Qu.:255 1st Qu.:240 1st Qu.:271
## Median :297 Median :280 Median :315
## Mean :312 Mean :293 Mean :333
## 3rd Qu.:352 3rd Qu.:332 3rd Qu.:371
## Max. :788 Max. :643 Max. :993
#using broom augment
model_prediction <- broom::augment(model_wild_west,
newdata= munich_listings_predict)
model_prediction <- model_prediction %>%
mutate(lower_95=10^.fitted-1.96*abs(10^(.resid)),#creating 95% confidence interval
upper_95=10^.fitted+1.96*abs(10^(.resid))) %>%
select(.fitted,
lower_95,
upper_95,
total_price_4_days) %>%
mutate(.fitted=10^.fitted)
#sanity check
summary(model_prediction)## .fitted lower_95 upper_95 total_price_4_days
## Min. :138 Min. :135 Min. :140 Min. : 92
## 1st Qu.:255 1st Qu.:253 1st Qu.:257 1st Qu.: 215
## Median :297 Median :295 Median :299 Median : 275
## Mean :312 Mean :310 Mean :314 Mean : 324
## 3rd Qu.:352 3rd Qu.:350 3rd Qu.:353 3rd Qu.: 376
## Max. :788 Max. :788 Max. :789 Max. :4036
Using the predict and augment function we observe a mean price of around 315, which is close to the actual total mean price of 327. We also see that the 1st and 3rd quartiles for both of the prediction methods all line up. Differences appear in the lower and upper confidence level boundaries between the two functions, where the predict function’s interval actually captures the true mean, the augment misses it by €10. Despite this we can get a sense of confidence for our linear regression’s accuracy due to the tight spread and capturing of the true mean. The next step is conducting a sanity check by checking the RMSE of our model.
##Data Training and RMSE In the next step we will split our data into two parts. We will train one part and later test another one. In the next step we will compare the results.
set.seed(1234)
train_test_split <- initial_split(munich_listings_predict, prop=0.7) # splitting dataset
munich_train<- training(train_test_split)
munich_test<- testing(train_test_split)
rmse_train <- munich_train %>% #training portion for RMSE
mutate(predictions=predict(model_wild_west,.)) %>%
summarise(sqrt(sum(predictions-log(total_price_4_days))**2/n())) %>%
pull()
rmse_train## [1] 81.2
rmse_test <- munich_test %>%
mutate(predictions=predict(model_wild_west,.)) %>%
summarise(sqrt(sum(predictions-log(total_price_4_days))**2/n())) %>%
pull()
rmse_test## [1] 52.8
We can see that the RMSE is an order of magnitude below our prices, which confirms that though our R^2 is low, our accuracy is very high.
Thank you for your interest in our study project. We hope you found it interesting.
Authors Magdalena Cloppenburg, Yichun Hou, Derek Leung, Malay Memani, Samy Mohamad, Agnieszka Prawda.